Fault Tolerance in Message Passing Interface Programs
نویسندگان
چکیده
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
منابع مشابه
Fault Tolerance in MPI Programs
This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, ...
متن کاملIn-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...
متن کاملMessage Relaying Techniques for Computational Grids and their Relations to Fault Tolerant Message Passing for the Grid
In order to execute without modification Message Passing distributed applications on a computational grid, one has to address many issues. The first to come is how let processes of two different clusters communicate. In this work, we study the performances of relaying techniques (passing messages to a middle-tier) to solve this issue. When using relays, messages and most of the nondeterministic...
متن کاملCan Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?
The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, “What agent capabilities are required for fault tolerance?”, “What parallel computational tasks can benefit f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IJHPCA
دوره 18 شماره
صفحات -
تاریخ انتشار 2004